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Recent successes illustrate the role of mass spectrometry-based proteomics as an indispensable tool for 
molecular and cellular biology and for the emerging field of systems biology. These include the study of 
protein-protein interactions via affinity-based isolations on a small and proteome-wide scale, the mapping of 
numerous organelles, the concurrent description of the malaria parasite genome and proteome, and the 
generation of quantitative protein profiles from diverse species. The ability of mass spectrometry to identify 
and, increasingly, to precisely quantify thousands of proteins from complex samples can be expected to 
impact broadly on biology and medicine. 



Proteomics in general deals with the large-scale 
determination of gene and cellular function 
directly at the protein level. But as the 
accompanying articles in this issue describe, 
the field is a collection of various technical 
disciplines, all of which contribute to proteomics. These 
include cell Imaging by light and electron microscopy, 
array and chip experiments, and genetic readout 
experiments, as exemplified by the yeast two-hybrid assay. 
Another powerful proteomic approach focuses on the de 
novo analysis of proteins or protein populations isolated 
from cells or tissues. Such studies typically pose challenges 
owing to the high degree of complexity of cellular 
proteomes and the low abundance of many of the 
proteins, which necessitates highly sensitive analytical 
techniques. Mass spectrometry (MS) has Increasingly 
become the method of choice for analysis of complex 
protein samples. MS-based proteomics is a discipline 
made possible by the availability of gene and genome 
sequence databases and technical and conceptual advances 
in many areas, most notably the discovery and 
development of protein ionization methods, as recognized 
by the 2002 Nobel prize in chemistry. 

Here we survey the state of the field, particularly as it has 
evolved over the three years since the last review in these 
pages 1 . Already, many of the dreams of the discipline have at 
least been partly realized. MS-based proteomics has 
established itself as an indispensable technology to Interpret 
the information encoded in genomes. So far, protein 
analysis (primary sequence, post-translational modifica- 
tions (PTMs) or protein-protein interactions) by MS has 
been most successful when applied to small sets of proteins 
Isolated in specific functional contexts. The systematic 
analysis of the much larger number of proteins expressed in 
a cell, an explicit goal of proteomics, is now also rapidly 
advancing, due mainly to the development of new experi- 
mental approaches. 

Today, proteomics still remains a multifaceted, rapidly 
developing and open-ended endeavour. Although it has 
enjoyed tremendous recent success, proteomics still faces 
significant technical challenges. Each breakthrough that 
either allows a new type of measurement or Improves the 
quality of data made by traditional types of measurements 
expands the range of potential applications of MS to molec- 
ular and cellular biology. Indeed, this field is already too 
expansive for a comprehensive, single review; thus we apol- 
ogize in advance for the many omissions. However, we do 
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hope that this article captures the excitement of recent 
achievements in MS-based proteomics, and points the way 
towards the direction future developments will likely take. 

Principles and instrumentation 

Mass spectrometric measurements are carried out in the gas 
phase on ionized analytes. By definition, a mass spectrome- 
ter consists of an ion source, a mass analyser that measures 
the mass-to-charge ratio {m/z) of the ionized analytes, and a 
detector that registers the number of ions at each /n/zvalue. 
Electrospray ionization (ESI) and matrix-assisted laser 
desorptlon/ionization (MALDI) are the two techniques 
most commonly used to volatlze and Ionize the proteins or 
peptides for mass spectrometric analysis 2,3 . ESI ionizes the 
analytes out of a solution and Is therefore readily coupled to 
liquid-based (for example, chromatographic and 
electrophoretic) separation tools (Fig. 1). MALDI subli- 
mates and ionizes the samples out of a dry, crystalline matrix 
via laser pulses. MALDI-MS Is normally used to analyse 
relatively simple peptide mixtures, whereas integrated 
liquid-chromatography ESI-MS systems (LC-MS) are 
preferred for the analysis of complex samples. 

The mass analyser is, literally and figuratively, central to 
the technology. In the context of proteomics, its key parame- 
ters are sensitivity, resolution, mass accuracy and the ability 
to generate information-rich ion mass spectra from peptide 
fragments (tandem mass or MS/MS spectra) (see Fig. 1 and 
refs 1,4,5). There are four basic types of mass analyser cur- 
rently used in proteomics research. These are the ion trap, 
tlme-of-flight (TOF), quadrupole and Fourier transform 
ion cyclotron (FT-MS) analysers. They are very different in 
design and performance, each with Its own strength and 
weakness. These analysers can be stand alone or, In some 
cases, put together In tandem to take advantage of the 
strengths of each (Fig. 2) . 

In ion-trap analysers, the ions are first captured or 
'trapped' for a certain time interval and are then subjected to 
MS or MS/MS analysis. Ion traps are robust, sensitive and 
relatively inexpensive, and so have produced much of the 
proteomics data reported in the literature. A disadvantage of 
ion traps is their relatively low mass accuracy, due in part to 
the limited number of ions that can be accumulated at their 
point-like centre before space-charging distorts their 
distribution and thus the accuracy of the mass measure- 
ment. The 'linear' or 'two-dimensional ion trap 6,7 is an 
exciting recent development where ions are stored in a 
cylindrical volume that is considerably larger than that of 
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Figure 1 Generic mass spectrometry (MS)-based proteomics experiment, The typical 
proteomics experiment consists offive stages, In stage 1 , the proteins to be analysed 
are isolated from cell lysate or tissues by bixhemical fractionation or affinity selection. 
This often includes a final step.of one-dimensional gel electrophoresis, ,and define? the 
'su^pfoteqme' to be artalysed. MS of whole proteins is less sensitive than peptkfe MS 
and the mass of the intact protein by itself is insufficient for Identification, Therefore, 
proteins are degraded enzymatically to peptides in stage 2, usually by trypsin, leading 
to peptides with C-ter^ '.■ 
subsequent peptide sequencing, Ih stage 3, the peptides are separated by one or more 
steps of high^pressure liquid chromatography in very fine capillaries and eiuted Into an 
electrospray ion source-Where they are n^ufe^ jrfsrhA -highly chared droplet . 
After evaporation, multiply protbriated peptides enter the mass spectrometer and, in 
stage 4. a mass spectrupi of the peptides eluting at this time point is taken (MS t ' 
•spectrum, cf 'normal miss spectrafn'), ThecompUergerierato^^ . 1 ' 

these peptides for fragmentation and.a series" of tandem mass spectrometric or . 
'MS/M^exp^^ 

Jon/rragrnehtation by energetic collision with gal, and recording ofthe tahfJem or' 
MS/MS spectrum. The MS and MS/MS spectra are typically acquired for about one. 
second each and stpredTor matching against proton sequence databases, The 
outcome of the experiment frthe identity r ' 
making up the purified protein population. 



the traditional, three-dimensional ion traps, allowing increased 
sensitivity, resolution and mass accuracy. The FT-MS instrument is 
also a trapping mass spectrometer, although it captures the ions 
under high vacuum in a high magnetic field. Its strengths are high 
sensitivity, mass accuracy, resolution and dynamic range 8 " 11 . But in 
spite of the enormous potential, the expense, operational complexity 
and low peptide-fragmentation efficiency of FT-MS instruments has 
limited their routine use in proteomics research. 

MALDI is usually coupled to TOF analysers that measure the mass 
of intact pepUdes, whereas ESI has mostly been coupled to ion traps 
and triple quadrupole instruments and used to generate fragment ion 
spectra (collision-induced (CID) spectra) of selected precursor ions 4 . 
More recently, new conflguraUons of ion sources and mass analysers 
have found wide application for protein analysis. To allow the 
fragmentation of MALDI-generated precursor Ions, MALDI ion 
sources have recently been coupled to quadrupole ion- trap mass spec- 
trometers 1 2 and to two types of TOF instruments. In the first, two TOF 
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sections are separated by a collision cell (TOF-TOF instrument') 13 , 
whereas in the second, the hybrid quadrupole TOF instrument, the 
collision cell is placed between a quadrupole mass filter and a TOF 
analyser 14 . Ions of a particular m/z are selected in a first mass analyser 
(TOF or quadrupole) , fragmented in a collision cell and the fragment 
ion masses are 'read out' by a TOF analyser. These instruments have 
high sensitivity, resolution and mass accuracy, and the quadrupole 
TOF instrument can be used interchangeably with an ESI ionization 
source. The resulting fragment ion spectra are often more extensive 
and informative than those generated in trapping instruments. 
Although TOF, ion-trap and hybrid TOF instruments dominate 
proteomics today, other configurations including linear ion traps and 
FT-MS instruments could become widespread in the near future. 

As a result of its simplicity, excellent mass accuracy, high 
resolution and sensitivity, M ALDI-TOF is still much used to identify 
proteins by what is known as peptide mapping, also referred to as 
peptlde-mass mapping or peptide-mass fingerprinting. In this 
method, proteins are identified by matching a list of experimental 
peptide masses with the calculated list of all peptide masses of each 
entry in a database (for example, a comprehensive protein database) . 
Because mass mapping requires an essentially purified target 
protein, the technique is commonly used in conjunction with prior 
protein fractionation using either one- or two-dimensional gel 
electrophoresis (IDE and 2DE, respectively). The addition of 
sequencing capability to the MALDI method should make protein 
identifications by MALDI-MS/MS more specific than those obtained 
by simple peptide-mass mapping (see below). It should also extend 
the use of MALDI to the analysis of more complex samples, thereby 
uncoupling MALDI-MS from 2DE. However, if MALDI-MS/MS is 
to be used with peptide chromatography, the effluent of a liquid 
chromatography run must be deposited on a sample plate and mixed 
with the MALDI matrix, a process that has thus far proven difficult to 
automate. In general, it can be expected that the trend towards the 
combination of liquid chromatography with ESI- or MALDI- 
MS/MS (Fig. 1) will continue. 

Protein identifications using peptide CID spectra are more clear- 
cut than those achieved by mass mapping because, in addition to the 
peptide mass, the peak pattern in the CID spectrum also provides 
information about peptide sequence. This information is not readily 
convertible into a full, unambiguous peptide sequence, that is, the ' de 
novo sequencing problem via MS is still not generally solved. Instead , 
the CID spectra are scanned against comprehensive protein sequence 
databases using one of a number of different algorithms, each with its 
strengths and weaknesses. The 'peptide sequence tag' approach 
extracts a short, unambiguous amino acid sequence from the peak 
pattern that, when combined with the mass Information, is a 
specific probe to determine the origin of the peptide 15 . In the 
cross-correlation method, peptide sequences in the database are 
used to construct theoretical mass spectra and the overlap or 'cross- 
correlation' of these predicted spectra with the measured mass 
spectra determines the best match 16 . In the third main approach, 
probability based matching', the calculated fragments from peptide 
sequences in the database are compared with observed peaks. From 
this comparison a score is calculated which reflects the statistical 
significance of the match between the spectrum and the sequences 
contained in a database 17 . 

In each of these methods the identified peptides are compiled 
into a protein 'hit list', which is the output of a typical proteomic 
experiment. Because protein identifications rely on matches with 
sequence databases, high-throughput proteomics is currently 
restricted largely to those species for which comprehensive sequence 
databases are available. 

Protein identification and quantification 

No method or Instrument exists that is capable of identifying and 
quantifying the components of a complex protein sample in a simple, 
single-step operation. Rather, individual components for separating, 
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Figure 2 Mass spectrometers used in proteome research. The left and right upper 
panels depict the ionization and sample introduction process In electrospray ionization . 
(ESI) and matrix-assisted laser desorption/ionization (MALDI). The different instrumental 
configuraticflSf^ . . 

(TOF) instruments, the ions are accelerated to high kinetic energy and are separated 
along a flight tube as a result of their differerit velocities. The ions are turned around in a 
reflector, which.compensates for slight differences in. kinetic energy and then impinge on 
a detector that amplifies and counts arriving tons* h The TOF-TOF instrument:. 



ratio are selected in the first TOF section^ fragmented in the collision ceil r and the masses 
of the fragments are separated in the second TOF section, c, Quadrupole mass ' 
spectrorneteRs select bj tlme/Varyingielectric fields betweeafour rods, which permit a . 



stable trajectory only for ions of a particular desired mte Again, ions of a particular m/z 
, are selected in a first section (Q,), fragmented in a collision cell (q J, and the fragments 
separated in Q 3 ;in the linear ion trap, ions are capturetiina quadruple section, depicted 
by the red dot in % they are then excited via resonant electric field and the fragments, 
are scanned out, creating the tandem mass spectrum, d, The quadrupole TOF 
instrument combines the front part of a triple quadruple instrument with a reflector TOF 
section for measuring the mass of the ions, e, The (three-dimensional) ion trap captures 
the ions as in the case of the linear ion trap, fragments tonspf a particular mte,and.then 
scans out the fragments to generate the tandem mass spettrum. f, The FT-MS . 
instrument also.traps the ions, but does so with the helj) of strong magnetic fields. The 
figure shows the combination of FT-MS with the linear ion trap for efficient isolation, 
fragmentation and fragmentdetection In the FT-MS section. "' 



identifying and quantifying the polypeptides as well as tools for 
integrating and analysing all the data must be used in concert. Out of a 
bewildering multitude of techniques and instruments, two main tracks 
can be identified. The first, and most commonly used, is a combination 
of 2DE and MS. The second track combines limited protein 
purification with the more recently developed techniques of automat- 
ed peptide MS/MS and, if accurate quantification Is desired, stable- 
isotope tagging of proteins or peptides. In either track a suitable data 
processing, storage and visualization Infrastructure needs to be 
developed, if the platform Is Intended for high-throughput operation. 

In the first track, the proteins in a sample are separated by 2DE, 
stained, and each observed protein spot is quantified by its staining 
intensity. Selected spots are excised, digested and analysed by MS. 
Sophisticated pattern -matching algorithms as well as interpretation 
by skilled researchers are required to relate the 2DE patterns to each 
other in order to detect characteristic patterns and differences among 
samples. 2DE has been a mature technique for more than 25 years and 
was the first technique capable of supporting the concurrent quanti- 
tative analysis of large numbers of gene products. In fact, many of the 
principles now commonly used for global, quantitative analysis of 
messenger RNA expression patterns, such as clustering algorithms 
and multivariate statistics, were developed in the context of 2DE 18 . 

Peptide-mass mapping by MALDI-TOF and peptide sequencing 
by ESI-MS/MS have become highly efficient at the Identification of 



gel-separated proteins. In the many reports using this technology, 
largely die same proteins were identified repeatedly, irrespective of 
the system studied, which suggests limited dynamic range of 
2DE-based proteomics. Systematic studies of the budding yeast 
Saccharomycescerevistae indeed revealed that typically only the most 
abundant proteins can be observed by this method . Incremental 
Improvements in 2DE technology, including more sensitive staining 
methods 20,21 , large-format higher resolving gels 22 and sample frac- 
tionation prior to 2DE have alleviated, but not eliminated, these and 
other shortcomings of the 2DE/MS approach. 

Studying major histocompatibility complex class I-assoclated 
peptides, a natural and complex peptide library, Hunt and colleagues 
pioneered the use of LC-MS/MS for the analysis of complex peptide 
mixtures and it is this method that is today at the core of MS-based 
proteomics 23 . However, before LC-MS/MS could be used both for the 
identification of protein mixtures and for quantitative proteomlc 
experiments, a number of technical Issues had to be addressed. 
First, single-dimension peptide chromatography does not provide 
sufficient peak capacity to separate peptide mixtures as complex as 
those generated by the proteolysis of protein mixtures of, for 
example, total cell lysates. Second, in both MALDI- and ESI-MS, the 
relationship between the amount of analyte present and measured 
signal Intensity is complex and incompletely understood. Mass 
spectrometers are therefore inherently poor quantitative devices. 
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Third, the amount of data collected by the method Is huge and Its 
analysis daunting. 

Substantial progress has been achieved In each of these areas, 
resulting in the emergence of increasingly robust and productive 
platforms. To provide more peak capacity, various combinations of 
protein and peptide separation schemes have been explored. Most 
popular at present are two-dimensional (strong cation 
exchange/reversed phase) 24,25 or three-dimensional (strong cation 
exchange/avidin/reversed phase) 28 chromatographic separations of 
peptide mixtures generated by tryptic digestion of protein samples 
that are frequently pre-fractionated by IDE. Several studies suggest 
that, in principle, these methods are capable of detecting proteins of 
very low abundance, although considerable effort is required and a 
sufficient amount of starting protein sample must be available 27,28 . 
However, no proteome has yet been completely analysed and, for lack 
of a suitable reference, it will be difficult to determine when that mile- 
stone has been achieved: 

To add a quantitative dimension to peptide LC-MS/MS experi- 
ments, the proven technique of stable-isotope dilution has been 
applied. This method makes use of the facts that pairs of chemically 
identical analytes of different stable-isotope composition can be 
differentiated in a mass spectrometer owing to their mass difference, 
and that the ratio of signal intensities for such analy te pairs accurately 
indicates the abundance ratio for the two analytes (Fig. 3). To this 
end, stable-isotope tags have been introduced to proteins via 
metabolic labelling using heavy salts or amino acids 29 , enzymatically 
via transfer of I8 0 from water to peptides 30,31 , oY via chemical 
reactions using isotope-coded affinity tags or similkr reagents 32,33 . 
Post-isolation chemical isotope tagging of proteins is currently the 
most versatile and most commonly used labelling method. An 
attractive feature of this approach is that the selectivity of the 
labelling reactions can be used to direct the isotopes and attached 
affinity tags to specific functional groups or protein classes, thus 
enabling their selective isolation and analysis. 

So far, isotope-tagging chemistries have been described that are 
specific for sulphydryl groups 32,33 , amino groups 34 , the active sites for 
serine 35 and cysteine hydrolases 36 , for phosphate ester groups 37,38 and 
for N-linked carbohydrates 39 . Site-specific isotope tagging is limited 
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only by the creativity of the chemist synthesizing suitable reagents. 
We therefore expect that new reagents will make many different types 
of sub-proteomes' accessible to quantitative analysis. Recently, a 
method called stable-Isotope labelling with amino acids in cell 
culture', or SILAC , has been described 40 . In this method, one cell state 
is metabolically labelled by, for example, l3 C-labelled arginine. 
Potentially all peptides can be labelled and the absence of any chemi- 
cal steps make the method easy to apply as well as compatible with 
multistage purification procedures. 

A current challenge for high- throughput proteomics is to use CID 
database search results from large numbers of peptide CID spectra to 
derive a list of identified peptides and their corresponding proteins. 
This task entails distinguishing correct peptide assignments from 
false identifications among database search results. In the case of 
small data sets, this can be achieved by researchers with expertise in 
spectral interpretation, manually verifying the peptide assignments 
to spectra made by database search programs. Such a time-consum- 
ing approach is not feasible for high-throughput analysis of large data 
sets containing tens of thousands of spectra, or when expertise is not 
available. 

Alternatively, researchers can attempt to separate the correct from 
incorrect peptide assignments by applying filtering criteria based 
upon database search scores and other available data* 5,26,28 . However, 
the rates of false Identifications that result from such filters are not 
known, nor is it known how those rates are affected by mass spec- 
trometer, sample preparation, or spectrum quality. In addition, 
researchers often use their own preferred filtering criteria, making it 
particularly difficult to compare their results among or even within 
groups. Consequently, the question of what constitutes an identified 
protein in a LC-MS/MS experiment has been difficult to answer. It Is 
therefore important that computer programs that use robust and 
transparent statistical principles to estimate accurate probabilities 
indicating the likelihood for the presence of a peptide or protein in 
the sample 41,42 are further developed and widely tested and applied. 

The technologies and tools described here are now being 
combined to create robust platforms for quantitative, high-through- 
put proteomics. This effort is aided by the introduction of the 
new types of high-performance mass spectrometers discussed 
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above. Currently, specialized MS laboratories can easily identify 
and quantify hundreds of proteins per day on a single MS system, and 
rapid advances in sample throughput, sensitivity and accuracy are 
projected. 

Applying proteomics technology to protein profiling 

Protein mixtures of considerable complexity can now be routinely 
characterized in some depth using the methods described above. One 
measure of technical progress is the number of proteins identified in 
each study. Such numbers can now reach into the thousands for suit- 
ably complex samples. But to be biologically useful, as opposed to 
simply highlighting analytical features of the methods, large-scale 
proteomic studies need to solve biological questions. In this regard, 
MS-based proteomics has interfaced particularly well with three 
types of biological or clinical questions. The first is the generation of 
protein-protein linkage maps. The second is the use of protein iden- 
tification technology to annotate and, if necessary, correct genomic 
DNA sequences. The third is the use of quantitative methods to 
analyse protein expression profiles as a function of cellular state as an 
aid to infer cellular function. 

The sequences of many mature proteins in higher eukaryotes, 
after processing and splicing, are often not directly apparent from 
their cognate DNA sequences. Peptide sequence data of sufficient 
quality provides unambiguous evidence of translation of a particular 
gene and can, in principle, differentiate between alternatively spliced 
or translated forms of a protein. Using a combination of MS and gene 
chip analysis , a number of proteins that were derived from previously 
undetected open-reading frames were found in the yeast genome 43 
and previously unknown human genes have also been found by 
direct searching of the human genome sequence 44 * 45 . 

Thus, it might be tempting to systematically analyse the proteins 
expressed by a cell or tissue, that is, to generate comprehensive 
proteome maps. First-generation large-scale proteome maps of 
microorganisms such as yeast 28 or the bacterium Delnococcus radio- 
durans 1 1 are examples of such projects and, with products from more 
than 60% of the genes identified, the Deinococcus map is at present 
the most complete. A recent review 46 of the proteomics of human 
plasma highlights a number of challenges facing comprehensive 
blood-serum analysis and thus, by implication, other samples from 
higher eukaryotes. Considering the combinatorial effects of splicing, 
processing and PTMs, plasma is estimated to contain many 
thousands to perhaps millions of polypeptide species, spanning a 
concentration range of up to 10 orders of magnitude. The fact that 
only about 500 proteins have so far been reported 47 , and very few 
have been quantified, illustrates the need for further technological 
developments to address these issues. 

The more common and versatile use of large-scale MS-based 
proteomics has been to document the expression of proteins as a 
function of cell or tissue state. We argue that to be meaningful, such 
data must be at least semi-quantitative and that a simple list of 
proteins detected in the different states is insufficient. This is because 
analyses of complex mixtures are often not comprehensive, and 
therefore the non-appearance of a particular sequence in the list of 
identified peptides does not indicate that the peptide or protein was 
not originally present in the sample. Additionally, it is often impossi- 
ble to prepare a certain cell type, cell fraction or tissue in completely 
pure form, without trace contaminations of other fractions. And 
because the ion current of a peptide is dependent on a multitude of 
variables that are difficult to control, this measure is not a good indi- 
cator of peptide abundance. If stable-isotope dilution has not been 
used, a rough relative estimate of the quantity of the protein can be 
gained by integrating the ion current of its peptide-mass peaks over 
their elution time and comparing these 'extracted ion currents' 
between states, provided that highly accurate and reproducible 
methods are used. 

The malaria parasite Plasmodium falciparum has recently been 
subjected to detailed proteomic analysis. The life cycle of the parasite 
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is complex; thus there is great interest in the proteins it expresses in its 
own different stages and in its different host compartments. Illustrat- 
ing the power and importance of proteomics, the recent genome 
project of the malaria parasite was accompanied by two large-scale 
proteome efforts. In one of the studies 48 the human stages of the 
parasite were analysed and a large number of proteins identified in 
the sexual and non-sexual stages. Quantitation was attempted by 
comparing peptide ion currents between stages and by correlation of 
protein data with RNA quantification by the polymerase chain reac- 
tion. After bioinformatic analysis, a set of proteins was selected from 
membrane fractions for follow-up as possible stage-specific drug or 
vaccine targets. The study resulted in more than 200 such candidate 
proteins and generated a large set of 'orphan peptides that were not 
found in the set of predicted proteins, but were mapped onto the 
genome, assisting its annotation. 

The other R falciparum proteomics project analysed mosquito 
and human stages of the parasite and reported a total of around 2,400 
identified proteins 49 . The study revealed unexpected stage specificity 
of a number of surface proteins and suggested co-expression of new 
proteins with groups of proteins already annotated as stage specific, 
helping to place these proteins in a functional context. The study also 
illustrated the need for transparent statistical tools to improve the 
confidence in protein identifications, as a large number, and in 
some cases the majority, of the proteins were identified solely by 
single peptides, many of which did not conform to the expected 
tryptlc cleavage pattern. 

Increasingly, stable-isotope dilution and LC-MS/MS are used to 
accurately detect changes in quantitative protein profiles and to infer 
biological function from the observed patterns. Shiio era/. 50 identi- 
fied the reduction of stress fibres and focal adhesions as a new cellular 
function of the Myc oncogene by comparing protein extracts from 
Myc + and Myc" cells. Han et al. identified plelotropic, differentia- 
tion-induced effects in the microsomal compartment of phorbol 
ester-treated HL-60 cells, and a number of studies have also identified 
previously unknown connections between metabolic processes 51,52 . 

Applying proteomics technology to protein interactions 

The analysis of protein complexes is the third area where MS-based 
proteomics has had a significant Impact. Most proteins exert their 
function by way of protein-protein Interactions and enzymes are often 
held in tightly controlled regions of the cell by such interactions. Thus, 
one of the first questions usually asked about a new protein — apart 
from where it is expressed — is to what proteins does it bind? To study 
this question by MS, the protein itself is used as an affinity reagent to 
isolate its binding partners. Compared with two-hybrid and chip- 
based approaches, this strategy has the advantages that the fully 
processed and modified protein can serve as the bait, that the interac- 
tions take place in the native environment and cellular location, and 
that multlcomponent complexes can be isolated and analysed in a 
single operation 53 . However, because many biologically relevant 
interactions are of low affinity, transient and generally dependent on 
thespeciflc cellular environment in which they occur, MS-based meth- 
ods in a straightforward affinity experiment will detect only a subset of 
the protein interactions that actually occur. Biolnformatlcs methods, 
correlation of MS data with those obtained by other methods, or 
iterative MS measurements possibly in conjunction with chemical 
crosslinklng 54 can often help to further elucidate direct interactions 
and overall topology of multlprotein complexes. 

MS-based protein interaction experiments have three essential 
components: bait presentation, affinity purification of the complex, 
and analysis of the bound proteins. Ideally, endogenous proteins can 
serve as bait if an antibody or other reagent exists that allows specific 
isolation of the protein with its bound partners. Unfortunately, there 
are currently no comprehensive antibody collections and many cur- 
rent antibodies do not immunoprecipltate well or lack sufficient 
specificity. A more generic strategy Is to 'tag* the proteins of Interest 
with a sequence readily recognized by an antibody specific for the tag. 
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To facilitate expression of the tagged protein at close to physiological 
levels, the tagged construct is preferably expressed from the promoter 
of its native, untagged counterpart. This can be achieved in a limited 
number of species, most notably 5. cerevisiae, by using homologous 
recombination to replace the endogenous gene in the genome with a 
gene coding the tagged protein. 

In mammalian cells, where expression of tagged proteins from the 
native promoter Is more difficult, they are usually expressed after 
transient transfection, in stable cell lines generated by traditional 
selection, or by recently introduced kits for fast generation of stable 
cell lines. Transient or stable transfections usually result in tagged- 
protein expression levels that are different from the untagged, 
endogenous counterpart. They are therefore prone to artefacts gen- 
erated by non-physiological levels of the bait protein. Considerable 
efforts have been devoted to developing tagging systems optimized 
for analysis of protein complexes (see review in this issue by Fields 
and co-workers, page 208) . Tags supporting single-step purification 
have the advantage of convenience and yield. Tags supporting two 
sequential affinity steps (tandem affinity purification or TAP) 
combine two different tags on the same protein, which are normally 
separated by an enzyme-cleavable linker sequence. 

A popular Implementation of this concept consists of a calmod- 
ulin-blnding domain In series with the immunoglobulln-blnding 
domain of protein A, the domains being separated by a sequence that 
can be cleaved by a tobacco etch virus (TEV) protease 55 . The tagged 
proteins are bound initially to a solid support modified with 
immunoglobulins, recovered by TEV proteolysis and bound to a 
calmodulin column from which they can be selectively eluted by 
increased [Ca 2+ ] . TAP tags significantly reduce background noise, 
but probably result in the loss of some of the more transient and weak 
binding partners during the purification procedure, as the second 
affinity step essentially causes infinite sample dilution. The 
identification part of the strategy is similar to the generic protein 
Identification experiment described above, and essentially all the 
strategies discussed here have been used for the analysis of protein 
complexes. However, it is clear that the cleaner the Initial purifica- 
tion, the less challenging the mass spectrometry 'readout' becomes. 

Combining such developments, two large-scale projects have 
recently been reported on the protein-protein interaction network in 
yeast. In one of the studies, 1,739 TAP-tagged genes were Introduced 
into the yeast genome by homologous recombination, 232 stable 
complexes were isolated and protein constituents were identified by 
MALDI peptide mapping after separation by IDE 56 . Apart from the 
large number of new Interactions for known and newyeast proteins, a 
higher-order interaction structure between complexes emerged 
from the data. A similar study used transient transfection to express 
FLAG-tagged bait proteins; complexes were isolated by single-step 
immunopurificatlon and any attached proteins identified by 
automated LC-MS/MS of gel-separated bands 57 . These experiments 
probed the phosphatases, kinases and the DNA-repair network of 
yeast specifically, resulting in many interesting signalling connec- 
tions being made. 

Both studies reported a large number of Interacting proteins, and 
while groups of selected bait proteins were only partly overlapping, a 
number of Interesting conclusions could be drawn from a compari- 
son of the results. First, protein complexes Isolated in a single step 
resulted in more complex samples than those isolated by the TAP tag 
procedure. Second, surprisingly little overlap of the data was 
observed when results from similar bait proteins were compared 58 
between the two studies or between the MS and previous yeast two- 
hybrid studies. Although a variety of technical explanations have 
been advanced to explain this discrepancy, it is Important to note that 
the 'interactome' is potentially very large, growing with the square of 
the number of proteins Involved, and that It remains substantially 
undersampled. Third, both projects reported results consistent with 
previous literature for already known complexes. As in other large- 
scale projects, higher accuracy can be obtained with more detailed 
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experiments, for example, in a complex walking* strategy in which 
complexes are tagged and identified sequentially". 

In the future, quantitative * methods based on stable-isotope 
labelling are likely to revolutionize the study of stable or transient 
interactions and interactions dependent on PTMs. In such 
experiments, accurate quantification by means of stable-isotope 
labelling is not used for protein quantification per se; instead the 
stable-isotope ratios distinguish between the protein composition of 
two or more protein complexes. In the case of a sample containing a 
complex and a control sample containing only contaminating 
proteins (for example, lmmunopreclpitation with an irrelevant 
antibody or isolate from a cell devoid of affinity-tagged protein) , the 
method can distinguish between true complex components and 
nonspeciflcally associated proteins. In the case of complexes isolated 
from cells at different states (for example, activated and non- 
activated cells) the method can Identify dynamic changes in the 
composition of protein complexes 60,61 . 
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The ability of quantitative MS to detect specific complex compo- 
nents within a background of nonspeclflcally associated proteins 
Increases the tolerance for high background and allows for fewer 
purification steps and less stringent washing conditions, thus 
increasing the chance of finding transient and weak interactions. The 
same methods can be used to study the interaction of proteins with 
nucleic acids, small molecules and in fact with any other substrate. 
For example, drugs can be used as affinity baits in the same way as 
proteins to define their cellular targets, and small molecules such as 
co-factors can be used to isolate interesting sub-proteomes' 62 . 

MS-based proteomics is not limited to the analysis of complexes 
consisting of only a few proteins. In fact, some of the most biological- 
ly Informative results have come from the analysis of large protein 
complexes — molecular machines' organelles and subcompart- 
ments of the cell. The first complex analysed in this way was the 
spliceosome. studied in yeast 63 and then in human cells", closely 
followed by the yeast nuclear pore complex 65 . Re-analysls of the 
spliceosome using more complete databases and more advanced 
instrumentation has recently been undertaken. In one study, nearly 
300 proteins were found, and evidence from sequence analysis 
highlighted a set of 55 novel proteins involved In splicing and RNA 
processing 66 . A similar study, using an elegant RNA tag-based purifi- 
cation, also discovered many new proteins 67 . Both studies found 
essentially the complete list of known human splicing factors. The 
new data encompassed and extended the original results, indicating 
the maturity of MS-based methods for the analysis of such complex 
structures. The next challenge will now be to study the dynamics and 
assembly of functional protein modules via quantitative proteomics. 

Numerous other large complexes and organelles have now at least 
been partly characterized by MS 68 . The limiting factor in such experi- 
ments Is no longer primarily the analysis, but rather the ability to purify 
such structures to homogeneity. For example, it is very difficult to 
Isolate structures such as the Golgi apparatus and the Interpretation of 
results from samples of dubious quality and definition is correspond- 
ingly vague. The largest organelle mapped so far is the human 
nucleolus, whose high specific density allows for a simple, efficient 
purification 45 (Fig. 4) . By using a variety of mass spectrometric tech- 
niques, more than 400 nucleolar proteins have now been identified in 
this structure. Well-characterized proteins identified in this study, but 
not previously known to be associated with nucleolus, raise interesting 
questions about the function of this organelle, while the identification 
of a large number of previously uncharacterized gene products places 
many of those in the context of nucleolar function. 

At the same time, some of the previously known nucleolar pro- 
teins, such as Werner s syndrome protein, have not yet been found, 
indicating that even this large-scale study is not yet complete. One 
reason for this is that numerous factors, including Werner's 
syndrome protein, exhibit either cell-cycle dependent or facultative 
interactions with nucleoli. Dynamic Imaging studies of the nucleus 
also make it clear that many of the factors in the nucleolus are 
associated only transiently with this organelle 69 , a fact reflected in the 
overlapping 'cast of characters' of several nuclear bodies studied. Just 
as the single protein/single function concept is turning out to be more 
the exception than the rule, the concept of a single subcellular loca- 
tion of a protein may also turn out to be a gross over-simplification. 

Applying proteomics to the analysis of protein modifications 

Proteins are converted to their mature form through a complicated 
sequence of post-translational protein processing and 'decoration' 
events. Many of the PTMs are regulatory and reversible, most notably 
protein phosphorylation, which controls biological function 
through a multitude of mechanisms. Mass spectrometric methods to 
determine the type and site of such modifications on single, purified 
proteins have been refined over the past two decades. In this case, 
peptide mapping with different enzymes Is usually used to 'cover' as 
much of the protein sequence as possible. Protein modifications are 
then determined by examining the measured mass and fragmenta- 
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tion spectra via manual or computer-assisted interpretation. For the 
analysis of some types of PTMs, specific mass spectrometric tech- 
niques have been developed that scan the peptides derived from a 
protein for the presence of a particular modification. The analysis of 
regulatory modifications, in particular protein phosphorylation, is 
complicated by the frequently low stoichiometry, the size and 
ionlzability of peptides bearing the modifications, and their 
fragmentation behaviour in the mass spectrometer (reviewed in 
refs 4 ,70,71) . The analysis of the modification state of a purified pro- 
tein therefore remains a challenging analytical endeavour. 

Recently, attempts have been made to define modifications on a 
proteome-wlde scale. Given the difficulties of Identifying all modifi- 
cations even in a single protein, it is clear that, at present, scanning for 
proteome-wide modifications is not comprehensive. Nevertheless, a 
large amount of biologically useful information can, in principle, be 
generated by this approach. One of the strategies used Is essentially an 
extension of the approach used to analyse protein mixtures 72 . Instead 
of searching the database only for non-modified peptides, the 
database search algorithm is instructed to also match potentially 
modified peptides. To avoid a 'combinatorial explosion resulting 
from the need to consider all possible modifications for all peptides in 
the database, the experiment Is usually divided Into Identification of a 
set of proteins on the basis of non-modified peptides, followed by 
searching only these proteins for modified peptides 72 . 

A more functionally oriented strategy focuses on the search for one 
type of modification on all the proteins present In a sample. Such 
techniques are based usually on some form of affinity selection that Is 
specific for the modification of interest and which is used to purify the 
sub-proteome bearing this modification. For example, Pandey et al. 
stimulated cells with epidermal growth factor and isolated newly 
phosphotyrosine-modified proteins using antibodies specific to phos- 
photyrosine 73,74 . Efforts to determine the 'phosphoproteome* In a single 
step 3 ** have used chemical modification combined with affinity 
selection. In a potentially powerful technique, Flccaro era/. 75 esterifled 
peptide mixtures, thereby nullifying negatively charged carboxyl 
groups, and then captured phosphopeptides on metal affinity columns. 
This approach overcomes the low specificity of these columns caused by 
their affinity for any negatively charged peptides and seems to 
significantly improve the capture of phosphopeptides. Further devel- 
opment of these and related techniques may allow study of the complete 
phosphoproteome In multlproteln complexes and the pattern of the 
more abundant phosphopeptides In whole cells, a promising approach 
to study the activation state of whole signalling networks. 

Gygi et al. have used affinity purification to capture the ubiquiti- 
nated proteins of yeast cells (ref. 76 and S. P. Gygl, personal 
communication). Over 1,000 such proteins were identified and in 
more than 1 00 cases the site of ubiqultinatlon was determined. These 
results open up the study of ubiquitlnated substrates In a cell state- 
and protein complex-dependent manner. 

Many challenges remain in the large-scale mapping of PTMs, but it 
is clear that MS-based proteomics can make a unique contribution in 
this area. For example, systematic quantitative measurements of PTMs 
by stable-isotope labelling would be of tremendous biological Interest. 

Challenges, expectations and emerging technologies 

Proteomics, In particular quantitative proteomics, can be viewed as 
an array of biological or clinical assays capable of probing most, if not 
all, of the proteins in a sample. As proteins are involved in essentially 
all biological functions and clinical conditions, MS and proteomics 
will have an even greater impact on biology and medicine than It has 
had so far. 

Over the past decade, MS of single proteins or protein complexes 
has been successful to the point where It Is now considered a main- 
stream technology. This technology interfaces particularly well with 
biochemical and cell biological studies for studying specific protein 
functions. The success is built on the proven potential of mass spec- 
trometric techniques to rapidly identify almost any protein, to 
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analyse that protein for the presence of PTMs, to determine how and 
with what other blomolecules that proteins interacts, and even to 
gain structural information about the protein from gas-phase 
experiments 77,78 and from experiments in which mass spectrometrlc 
characterization of proteins has interfaced with X-ray crystallogra- 
phy 79 . As analytical methods and instrumentation are improving 
constantly, MS can be use to address an increasing number of analyti- 
cal problems facing biochemists, geneticists and cell biologists. But 
protein MS does not equal proteomics. The specific objective of 
proteomics is to concurrently identify, quantify and analyse a large 
number of proteins in a functional context. This shift In focus from 
the analysis of selected isolated proteins to proteome-wide analyses 
has a number of profound implications and poses as yet unmet 
challenges for every aspect of experimental biology. These include 
experimental design, data analysis, visualization and storage, 
organization of proteomics research groups and publication of 
proteomicdata. 

Experimental design 

In a typical protein MS experiment, a specific property (for example, 
sequence, PTM or interaction) of a partly characterized protein is 
examined. In contrast, proteomic experiments often collect large 
amounts of data in the absence of hypotheses concerning specific 
proteins or activities. Proteomic experiments, therefore, have to be 
designed in ways that maximize the likelihood of generating new dis- 
coveries, or at least new testable hypotheses. The technology of gene 
expression profiling Is conceptually similar to proteomic profiling 
and has demonstrated that more information is better. Although it is 
essentially Impossible to draw meaningful conclusions from a single 
quantitative gene expression profile, the availability of multiple 
profiles from related samples allows the application of statistical 
tools 80 to extract signature patterns containing diagnostic or func- 
tional information. Therefore, successful proteomics experiments 
need to be designed in such a way that they can take advantage of the 
power of statistics for data Interpretation. To achieve this goal , careful- 
ly controlled repeat studies and the generation of models describing 
the source, magnitude and distribution of errors will be essential. 

Data collection 

Proteomic studies necessarily result in large amounts of data. Data 
collection at a volume and quality that is consistent with the use of 
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statistical methods is a significant limitation of proteomics today. In a 
typical LC-MS/MS experiment, approximately 1,000 CID spectra 
can be acquired per hour. Even with the optimistic assumption that 
every one of these spectra leads to the successful identification of a 
peptide, it would take a long time to analyse complete proteomes. 
High-throughput collection of consistently high-quality data there- 
fore remains a challenge in proteomics. We have argued that one 
solution to the problem would be to establish a number of specialized 
and generally accessible data-collection centres 81 , akin to the beam 
lines used by X-ray crystallographers for protein structural studies. 
Such centres would not only generate data of consistent quality for a 
large number of proteomics projects, but would also serve as dissem- 
inators of advanced technology. 

Data analysis, visualization and storage 

The analysis and Interpretation of the enormous volumes of 
proteomic data remains an unsolved challenge, particularly for 
gel-free approaches. Expert manual analysis is incompatible with the 
tens of thousands of spectra collected in a single experiment and is 
inconsistent. Therefore, the development of transparent tools for the 
analysis of proteomic data using statistical principles Is a key 
challenge 41,42 . Only once such tools are tested, validated and widely 
accepted will it become feasible to apply quality standards for protein 
identification, quantification and other measurements and to 
compare complementary proteomic data sets generated In different 
laboratories. These comparisons will also depend critically on 
transparent file structures for data storage, communication and 
visualization. The development of such proteomics tools is still in Its 
infancy. 

Data publication 

The publication of the large data sets generated by proteomic 
experiments and the Information contained therein poses significant 
challenges. At present, most proteomics publications consist of a 
experimental description, a data table (typically published as supple- 
mentary material containing a partially interpreted and validated 
summary of the data) and an in-depth validation and discussion of 
one to a few conclusions made from the data. To make publication of 
proteomics data more useful, publishers and Journals need to find 
new ways to review large data sets, to validate their contents and to 
make the information contained therein electronically searchable; 



© 2003 Nature Publishing Group 



206 



insight review articles 



this problem remains essentially unsolved, despite preliminary 
developments by a few publishers and journals 82 . 

In spite of these and other challenges, the impact of proteomics on 
clinical and biological research is growing rapidly. It seems that 
beyond its great current contribution to cell biology, proteomics may 
have a huge influence on clinical diagnosis. MS-based proteomics 
seems capable of detecting patterns of differentially expressed 
proteins in easily accessible clinical samples such as blood serum. 
These types of analyses have the potential to diagnose the presence 
and stage of many diseases, in particular cancers 83 . Clinical diagnosis 
will be further advanced with the advent of mass spectrometers with 
higher mass accuracy, dynamic range and resolution, and with the 
ability to identify specific sequences of diagnostic analy tes and the 
use of accurate quantification procedures. 

MS-based proteomics is still an emerging technology where 
revolutionary change is possible. Several concepts have been 
proposed and are under development that have the potential to alter 
the landscape of current MS-based proteomic technologies. One of 
these is the analysis of intact proteins. The currency of essentially all 
MS-based identifications is peptides. The convergence of mass 
spectrometers with large mass ranges, extremely high mass accuracy 
and resolution, and ionization/fragmentation methods compatible 
with large proteins has catalysed the emergence of whole-protein 
proteomics 84 . The analysis of whole proteins with high accuracy has 
the potential to distinguish and characterize differentially modified 
forms and to provide insights into coordinated modification 
patterns that are difficult to establish by peptide analysis. 

A second emerging concept is mass spectrometric tissue 
imaging 85 . In this technique, thin tissue sections are directly applied 
to a MALDI mass spectrometer and, after treating the samples with a 
suitable matrix, profiles of the proteins contained in the section are 
generated by 'imaging' the sample with an array of mass spectra. The 
method, while currently Incapable of identifying the detected protein 
features, has already provided proof-of-principle that clinically 
diagnostic patterns can be generated. Increased spatial resolution, 
potentially to subcellular levels, improved software tools and auto- 
mated sample preparation will further increase the utility of this 
technique for clinical diagnosis and classification. 

A third concept is the use of mass tags measured in mass 
spectrometers of very high mass accuracy and resolution such as 
FT-MS instruments. These mass tags could be used potentially for 
high-throughput protein identification. The idea is based on the 
observation that a particular proteome, if digested with a specific 
enzyme such as trypsin, will generate a peptide mixture in which 
most peptides can be uniquely classified based on their accurate 
mass and some other parameters such as chromatographic 
coordinates 86,87 . Therefore, once the peptides are identified by 
MS/MS and annotated with accurate mass tags they can be identified 
in subsequent experiments simply by correlating the accurate mass 
and the separation coordinates with the list of previously determined 
mass tags. 

Conclusion and perspective 

In studying a biological system using the biochemical approach, 
researchers have traditionally attempted to purify to homogeneity 
each of the systems components; each element is then studied in 
detail with the ultimate aim being to reconstitute the system in vitro 
from the isolated components. Because proteins carry out most 
biological activities, the biochemical approach has been significantly 
enhanced by the availability of the sensitive and rapid MS-based 
protein identification methods discussed in this article. The 
availability of complete genomic sequences from a number of species 
further facilitates MS-based protein identifications, as the require- 
ment for de novo sequencing has been usurped by simple correlation 
of measured data versus theoretical data predicted from sequence 
databases. The availability of completely sequenced genomes also 
catalysed the emergence of systems biology — the attempt to system- 
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atically study all the concurrent physiological processes In a cell or 
tissue by global measurement of differentially perturbed states 
(Fig. 5) . The ultimate goal of systems biology is the integration of data 
from these observations into models that might, eventually, 
represent and simulate the physiology of the cell 88 . 

Proteomics is an essential component of systems biology research 
because proteins are rich in information that has turned out to be 
extremely valuable for the description of biological processes. These 
Include protein abundances, linkage maps to other proteins or to 
other types of biomolecules including DNA and lipids, activities, 
modification states, subcellular location and more. Unfortunately, 
with the exception of quantitative protein profiles and protein-pro- 
tein interactions (keeping in mind the caveats discussed above) , none 
of these properties can currently be measured systematically, 
quantitatively and with high throughput. But rapid advances in 
technology suggest that this limitation may be transient. The few 
studies where the same biological system was subjected to different 
types of systematic measurements already offer insights into the 
power of the method. For instance, mRNA expression profiles and 
protein expression profiles seem to be largely complementary and 
therefore contribute to a more refined description of the system that 
each observation by itself Is unable to provide 88 . 

Extrapolating from these limited studies, we expect that 
combining different genomic and proteomic results obtained from 
the same biological system will substantially increase our 
understanding of complex biological processes. More specifically, 
the systems biology studies based on diverse and high-quality 
proteomic data will define functional biological modules, reveal 
previously unrecognized connections between biochemical 
processes and modules, and generate hew hypotheses that can be 
tested either by traditional methods or by the targeted generation of 
more genomic and proteomic data 51,88 " 90 . □ 
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